Midterm Project: Affects of Air Pollution on Countries
1. Introduction
Motivation
Air pollution affects blah blah blah… and with the increased worsening in climate and air quality, blah blah blah…. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s progress can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?
Set Up
Before we start, we need to ensure that we have all the relevant libraries installed and imported.
Run these in the console to install packages in addition to the ezids package.
install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")
2. Data Sources and Data Wrangling
Data Sources
For our analysis, we will be working with 5 main data sources shown in the table below:
| Data | Source | Link |
|---|---|---|
| Deaths Due to Air Pollution of Countries from 1990 - 2017 | Kaggle | Link |
| GDP Annual Growth of Countries from 1960 - 2020 | Kaggle via WorldBank | Link |
| United Nations Population and Region Data | United Nations | Link |
| United Nations ISO-alpha3 code | United Nations | Link |
| spData for Map Geometries | spData for Mapping | Link |
The main features in our datasets will include:
| Feature | Data Type | Unit of Measure | Notes and Assumptions |
|---|---|---|---|
| GDP (Gross Domestic Product) | Numerical, Continuous | $USD | This is our chosen proxy for measuring a country’s progress. |
| Population Size | Numerical, Continuous | thousands of people | Annual UN estimated |
| Deaths due to Air Pollution | Numerical, Continuous | deaths per million | This is our chosen proxy for measuring the negative affects of air pollution. |
| Country | Qualitative, Categorical | N/A | 231 countries |
| SDG Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Region Classification. |
| Sub Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Sub-Region Classification. |
| ISO-alpha3 Country Code | Qualitative, Categorical | N/A | Standard for identifying countries (text ID). |
| ISO-alpha2 Country Code | Qualitative, Categorical | N/A | Another standard for identifying countries (text ID). |
| M49 Country Code | Numerical, Categorical | N/A | Another standard for identifying countries (numerical ID). |
| Year | Numerical, Categorical | N/A | 1990 to 2017 |
| GDP per Capita | Numerical, Continuous | $USD per person | Normalization of GDP to compare between population sizes (calculated). |
Data Wrangling
While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.
- Fig 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
- Fig 4: Sample screenshot of transformed UN dataset.
3. Load, Clean, and Inspect Data
Load Data
country_codes_df <- read.csv("data/country_codes.csv", header = TRUE, sep = ",")
air_pollution_df <- read.csv("data/death-rates-from-air-pollution.csv", header = TRUE, sep = ",")
gdp_df <- read.csv("data/GDP_annual_growth.csv", header = TRUE, sep = ",")
population_region_df <- read.csv("data/population_in_thousands_region.csv", header = TRUE, sep = ",")str(country_codes_df)str(air_pollution_df)str(gdp_df)str(population_region_df)str(world)Clean Data
First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).
Clean air_pollution_df:
# Remove columns
air_pollution_df_cleaned <- subset(air_pollution_df, select = -c(Indoor.air.pollution..deaths.per.100.000., Outdoor.particulate.matter..deaths.per.100.000., Outdoor.ozone.pollution..deaths.per.100.000.))
# Set datatypes
air_pollution_df_cleaned$Entity = factor(air_pollution_df_cleaned$Entity)
air_pollution_df_cleaned$Code = factor(air_pollution_df_cleaned$Code)
air_pollution_df_cleaned$Year = factor(air_pollution_df_cleaned$Year)
# Rename columns
air_pollution_df_cleaned <- rename(air_pollution_df_cleaned, Country = Entity, ISO.alpha3.code = Code, Deaths.Air.Pollution.per.100k = Air.pollution..total...deaths.per.100.000.)
str(air_pollution_df_cleaned)Clean gdp_df:
# Remove columns
gdp_df_cleaned <- subset(gdp_df, select = -c(Indicator.Name, Indicator.Code))
# Pivot wide to long dataset (data melt)
gdp_df_cleaned <- gdp_df_cleaned %>%
pivot_longer(
cols = starts_with("X"),
names_to = "Year",
values_to = "GDP.USD",
values_drop_na = TRUE
)
# Remove characters from column
gdp_df_cleaned$Year<-gsub("X","",as.character(gdp_df_cleaned$Year))
# Set datatypes
gdp_df_cleaned$Country.Name = factor(gdp_df_cleaned$Country.Name)
gdp_df_cleaned$Country.Code = factor(gdp_df_cleaned$Country.Code)
gdp_df_cleaned$Year = factor(gdp_df_cleaned$Year)
# Rename columns
gdp_df_cleaned <- rename(gdp_df_cleaned, Country = Country.Name, ISO.alpha3.code = Country.Code)
str(gdp_df_cleaned)Clean population_region_df:
# Remove columns
population_region_df_cleaned <- subset(population_region_df, select = -c(Notes, Type, Parent.code))
# Pivot wide to long dataset (data melt)
population_region_df_cleaned <- population_region_df_cleaned %>%
pivot_longer(
cols = starts_with("X"),
names_to = "Year",
values_to = "Population.thousands",
values_drop_na = TRUE
)
# Remove characters from column
population_region_df_cleaned$Year<-gsub("X","",as.character(population_region_df_cleaned$Year))
population_region_df_cleaned <- as.data.frame(apply(population_region_df_cleaned, 2, function(x) gsub("\\s+", "", x)))
# Set datatypes
population_region_df_cleaned$Country = factor(population_region_df_cleaned$Country)
population_region_df_cleaned$SDGRegion = factor(population_region_df_cleaned$SDGRegion)
population_region_df_cleaned$Country.code = factor(population_region_df_cleaned$Country.code)
population_region_df_cleaned$SubRegion = factor(population_region_df_cleaned$SubRegion)
population_region_df_cleaned$Year = factor(population_region_df_cleaned$Year)
population_region_df_cleaned$Population.thousands = as.numeric(population_region_df_cleaned$Population.thousands)
# Rename columns
population_region_df_cleaned <- rename(population_region_df_cleaned, M49.code = Country.code)
str(population_region_df_cleaned)Clean population_region_df:
# Set datatypes
country_codes_df$M49.code = factor(country_codes_df$M49.code)
country_codes_df$Country.or.Area = factor(country_codes_df$Country.or.Area)
country_codes_df$ISO.alpha3.code = factor(country_codes_df$ISO.alpha3.code)
country_codes_df$ISO.alpha2.code = factor(country_codes_df$ISO.alpha2.code)
str(country_codes_df)Clean world:
# Set datatypes
world$iso_a2 = factor(world$iso_a2)
# Remove columns
world_df_cleaned <- subset(world, select = c(iso_a2, geom))
str(world_df_cleaned)Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.
Final DataFrame Construction
Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).
# Join datasets
final_df <- merge(x = air_pollution_df_cleaned, y = country_codes_df, by = 'ISO.alpha3.code')
final_df <- merge(x = final_df, y = gdp_df_cleaned, by = c('ISO.alpha3.code', 'Year'))
final_df <- merge(x = final_df, y = population_region_df_cleaned, by = c('M49.code', 'Year'))
final_df <- merge(x = final_df, y = world_df_cleaned, by.x = 'ISO.alpha2.code', by.y = 'iso_a2', all.x = TRUE) #left join
# Remove columns
final_df <- subset(final_df, select = -c(Country.or.Area, Country.y, Country))
# Calculate GDP per capita
final_df$gdp.per.capita <- final_df$GDP.USD / final_df$Population.thousands
str(final_df)Our dataset is finally ready to be analyzed.
4. EDA - Exploratory Data Analysis
Quick Plots
Let’s start our EDA process by just looking at some quick plots to look at the distribution of data.
Histogram of Numerical Features
ggplot() + geom_histogram(data = final_df, aes(x = Deaths.Air.Pollution.per.100k)) +
labs(title = "Distribution of Deaths per 100k from Air Polution")ggplot() + geom_histogram(data = final_df, aes(x = Population.thousands)) +
labs(title = "Distribution of Population")ggplot() + geom_histogram(data = final_df, aes(x = gdp.per.capita)) +
labs(title = "Distribution of GDP per Capita")- Fig 5,6,7: Histogram of numerical features.
Looks like deaths.air.pollution.per.100k, population, and gdp.per.capita are not normal and are all right skewed.
Boxplot of Numerical Features
Let’s look at a boxplot for the outliers.
ggplot(data = final_df,aes(x = SDGRegion, y=Deaths.Air.Pollution.per.100k)) +
geom_boxplot() +
xlab("SDG Region") +
ylab("Deaths per 100k from Air Polution") +
labs(title = "BoxPlot of Deaths per 100k from Air Pollution vs SDG Region") +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 5))- Fig 8: Boxplot of Deaths per 100k from Air Pollution vs SDG Region
Interesting to note that Australia/New Zealand, Europe, North America seem to have the lowest deaths per 100k from air pollution and are all fairly compactly packed together (low variance) compared to other regions around the world. Let’s take another look but at SubRegions.
ggplot(data = final_df,aes(x = SubRegion, y=Deaths.Air.Pollution.per.100k)) +
geom_boxplot() +
xlab("Sub Regions") +
ylab("Deaths per 100k from Air Polution") +
labs(title = "BoxPlot of Deaths per 100k from Air Pollution vs Sub Regions") +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4))- Fig 9: Boxplot of Deaths per 100k from Air Pollution vs Sub Region
Separating out into an even granular grouping of regions show some trends where Australia/New Zealand, North America, Northern Europe, and Western Europe all have low deaths per 100k and have low variance. Historically, these regions consist of countries that have been considered ‘First World’ before our first year of analysis of 1990. We will dig into this more later in our SMART questions.
What do the GDP per capita of these regions look like comparatively? Let’s take a look.
ggplot(data = final_df,aes(x = SubRegion, y=gdp.per.capita)) +
geom_boxplot() +
xlab("Sub Regions") +
ylab("GDP per Capita (USD per person)") +
labs(title = "BoxPlot of GDP per Capita vs Sub Regions") +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4))- Fig 10: Boxplot of GDP per Capita vs Sub Region
Interesting to observe that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively. We will try to see if we can quantify this relationship later on in our main research analysis.
Map of Countries
Plotting maps and maps with intensities will be useful for us to visualize our data and the results of our analysis.
# Convert to sf object so we can plot it
final_df_sf = st_as_sf(final_df)
# Choose relevant columns
final_df_sf_regions <- subset(final_df_sf, select = c(SDGRegion, SubRegion))
final_df_sf_intensities <- subset(final_df_sf, select = c(gdp.per.capita, Deaths.Air.Pollution.per.100k, Population.thousands))plot(final_df_sf_regions)- Fig 11: Global Map of SDGRegions and SubRegions
plot(final_df_sf_intensities)- Fig 12: Global Intensity Map of Key Numerical Features, 1990 to 2017
Looks like some inverse correlation between gdp.per.capita and deaths.air.pollution.per.100k.
We can also use ggplot2 to have a bit more control over map plotting.
g1 = ggplot() + geom_sf(data = final_df_sf, aes(fill = Deaths.Air.Pollution.per.100k), color = 'black') + scale_fill_distiller(palette = "YlGnBu") + coord_sf(crs = st_crs(4283))
g1- Fig 13: Global Intensity Map of Deaths due to Air Pollution per 100k People, 1990 to 2017
g2 = ggplot() + geom_sf(data = final_df_sf[final_df_sf$SDGRegion == 'EASTERNANDSOUTH-EASTERNASIA' & final_df_sf$Year == 2017, ], aes(fill = Deaths.Air.Pollution.per.100k), color = 'black') + scale_fill_distiller(palette = "YlGnBu") + coord_sf(crs = st_crs(4283))
g2- Fig 14: Intensity Map of Deaths due to Air Pollution per 100k People in East and Southeastern Asia, 2017
SMART Questions
1. Is there a relationship between population size and pollution?
3. Which years have the lowest and highest pollution?
str(air_pollution_df)# select what we need
air_pollution_dfneed <- subset(air_pollution_df, select = -c(Code))# make Entity and year factor
air_pollution_dfneed$Entity = factor(air_pollution_dfneed$Entity)
air_pollution_dfneed$Year = factor(air_pollution_dfneed$Year)# sum up the death of each year
sum_total <- aggregate(air_pollution_dfneed$Air.pollution..total...deaths.per.100.000. , by=list(Year=air_pollution_dfneed$Year), FUN=sum)
names(sum_total)[names(sum_total) == 'x'] <- 'air_pollution_total'
sum_indoor <- aggregate(air_pollution_dfneed$Indoor.air.pollution..deaths.per.100.000. , by=list(Year=air_pollution_dfneed$Year), FUN=sum)
names(sum_indoor)[names(sum_indoor) == 'x'] <- 'air_pollution_indoor'
sum_outdoorP <- aggregate(air_pollution_dfneed$Outdoor.particulate.matter..deaths.per.100.000. , by=list(Year=air_pollution_dfneed$Year), FUN=sum)
names(sum_outdoorP)[names(sum_outdoorP) == 'x'] <- 'air_pollution_outdoorP'
sum_outdoorO <- aggregate(air_pollution_dfneed$Outdoor.ozone.pollution..deaths.per.100.000. , by=list(Year=air_pollution_dfneed$Year), FUN=sum)
names(sum_outdoorO)[names(sum_outdoorO) == 'x'] <- 'air_pollution_outdoorO'
# Merge the new variables by year
total1 <- merge(sum_total,sum_indoor, by="Year")
total2 <- merge(sum_outdoorO,sum_outdoorP, by="Year")
total3 <- merge(total1,total2, by="Year")
# Make graph using year as X with total death
library(ggplot2)
# total
ggplot(total3, aes(x=Year, y=air_pollution_total)) +
geom_bar(stat='identity', position='dodge')# indoor
ggplot(total3, aes(x=Year, y=air_pollution_indoor)) +
geom_bar(stat='identity', position='dodge')# outdoor particulate
ggplot(total3, aes(x=Year, y=air_pollution_outdoorP)) +
geom_bar(stat='identity', position='dodge')# outdoor ozone
ggplot(total3, aes(x=Year, y=air_pollution_outdoorO)) +
geom_bar(stat='identity', position='dodge')# The max and min years
special1.1 <- sum_total[which.max(sum_total$air_pollution_total),]
special1.2 <- sum_total[which.min(sum_total$air_pollution_total),]
special1 <- rbind(special1.1, special1.2) # Max and min of total
special2.1 <- sum_indoor[which.max(sum_indoor$air_pollution_indoor),]
special2.2 <- sum_indoor[which.min(sum_indoor$air_pollution_indoor),]
special2 <- rbind(special2.1, special2.2) # Max and min of indoor
special3.1 <- sum_outdoorP[which.max(sum_outdoorP$air_pollution_outdoorP),]
special3.2 <- sum_outdoorP[which.min(sum_outdoorP$air_pollution_outdoorP),]
special3 <- rbind(special3.1, special3.2) # Max and min of ondoorP
special4.1 <- sum_outdoorO[which.max(sum_outdoorO$air_pollution_outdoorO),]
special4.2 <- sum_outdoorO[which.min(sum_outdoorO$air_pollution_outdoorO),]
special4 <- rbind(special4.1, special4.2) # Max and min of ondoorO
library("ggplot2")
# Year of max and min of total pollution
ggplot(special1, aes(x=Year, y=air_pollution_total)) +
geom_bar(stat='identity', position='dodge')# Year of max and min of indoor pollution
ggplot(special2, aes(x=Year, y=air_pollution_indoor)) +
geom_bar(stat='identity', position='dodge')# Year of max and min of ourdoor particulate pollution
ggplot(special3, aes(x=Year, y=air_pollution_outdoorP)) +
geom_bar(stat='identity', position='dodge')# Year of max and min of outdoor ozone pollution
ggplot(special4, aes(x=Year, y=air_pollution_outdoorO)) +
geom_bar(stat='identity', position='dodge')4. How does air pollution increase over time? More specifically, are death rates in recent X amount of years higher than death rates from groups of X years before?
5. Main Research Question
Do lower GDP countries have more deaths per 100k due to air pollution?
Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?
Linear Fit
Let’s first look at the general fit on the overall data.
fit1 <- lm(Deaths.Air.Pollution.per.100k ~ gdp.per.capita, data=final_df_sf)
ggplot(final_df_sf, aes(gdp.per.capita, Deaths.Air.Pollution.per.100k)) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
stat_regline_equation(label.y = 400, aes(label = ..eq.label..)) +
stat_regline_equation(label.y = 350, aes(label = ..rr.label..))- Fig XX: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.
From the plot, we observe that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.
Even looking at a each individual SDGRegion, their linear fits get better overall but are still not particularly strong with the highest being Australia/New Zealand and Europe at R2 of 0.56 and 0.55 respectively.
ggplot(final_df_sf, aes(gdp.per.capita, Deaths.Air.Pollution.per.100k)) +
geom_point(size=0.75) +
geom_smooth(method='lm') +
stat_regline_equation(label.y = 350, aes(label = ..eq.label..), size=2) +
stat_regline_equation(label.y = 300, aes(label = ..rr.label..), size=2) +
theme_grey(base_size = 7) +
facet_wrap(~SDGRegion, ncol = 3, scales='free')- Fig XX: Linear models for each SDGRegion, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.
Let’s now look at how time plays a part.
ggplot(final_df_sf, aes(gdp.per.capita, Deaths.Air.Pollution.per.100k)) +
geom_point(size=0.25) +
geom_smooth(method='lm') +
stat_regline_equation(label.y = 350, aes(label = ..eq.label..), size=1.5) +
stat_regline_equation(label.y = 300, aes(label = ..rr.label..), size=1.5) +
theme_grey(base_size = 4) +
facet_wrap(~Year, ncol = 7, scales='free')- Fig XX: Linear models for each Year, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.
As observed, time does not seem to play a significant part in describing the relationship between deaths due to air pollution per 100k vs GDP per capita as the R2 stays roughly constant around 0.3 across all the years.
Perhaps we should look at a non-linear fit. From our visuals, we see that every plot starts off at really high deaths due to air pollution per 100k then drops off dramatically as GDP per capita increases. However, the drop off begins to tamper off and asymptotically approaches some value. (It will be interesting to see if we can generalize what that GDP per capita value is. Let’s table that for later.) We have seen this type of behavior before in log graphs like shown below.
- Fig XX: Sample log graph.
Our data seems to be a -log(x) instead of log(x). Let’s transform our linear fit to a log fit by wrapping our features into a log() function and fitting back to a linear fit and see what the relationship is.
fit2 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita), data=final_df_sf)
summary(fit2)
# print(vif(fit2))
ggplot(final_df_sf, aes(log(gdp.per.capita), log(Deaths.Air.Pollution.per.100k))) +
geom_point(size=0.75) +
geom_smooth(method='lm') +
stat_regline_equation(label.y = 7, aes(label = ..eq.label..), size=2) +
stat_regline_equation(label.y = 6, aes(label = ..rr.label..), size=2) +
theme_grey(base_size = 7) +
facet_wrap(~SDGRegion, ncol = 3, scales='free')ggplot(final_df_sf, aes(log(gdp.per.capita), log(Deaths.Air.Pollution.per.100k))) +
geom_point(size=0.25) +
geom_smooth(method='lm') +
stat_regline_equation(label.y = 7, aes(label = ..eq.label..), size=1.5) +
stat_regline_equation(label.y = 6, aes(label = ..rr.label..), size=1.5) +
theme_grey(base_size = 4) +
facet_wrap(~Year, ncol = 7, scales='free')ggplot(final_df_sf, aes(log(gdp.per.capita), log(Deaths.Air.Pollution.per.100k))) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
stat_regline_equation(label.y = 7, aes(label = ..eq.label..)) +
stat_regline_equation(label.y = 6, aes(label = ..rr.label..))- Fig XX, XX, XX: Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship.
Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.
Let’s test a few more regression models adding more features.
fit3 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion, data=final_df_sf)
fit4 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year, data=final_df_sf)
fit5 <- lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*Year, data=final_df_sf)
print(summary(fit3)$r.squared)
print(summary(fit4)$r.squared)
print(summary(fit5)$r.squared)
#print(vif(fit3))
#print(vif(fit4))
#print(vif(fit5))R2 values for adding more features are 0.883, 0.887, and 0.747.
xkablevif(fit3, title='Fig XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year')| log(gdp.per.capita) | log(gdp.per.capita):SubRegionCaribbean | log(gdp.per.capita):SubRegionCentralAmerica | log(gdp.per.capita):SubRegionCentralAsia | log(gdp.per.capita):SubRegionEasternAfrica | log(gdp.per.capita):SubRegionEasternAsia | log(gdp.per.capita):SubRegionEasternEurope | log(gdp.per.capita):SubRegionMelanesia | log(gdp.per.capita):SubRegionMicronesia | log(gdp.per.capita):SubRegionMiddleAfrica | log(gdp.per.capita):SubRegionNorthernAfrica | log(gdp.per.capita):SubRegionNORTHERNAMERICA | log(gdp.per.capita):SubRegionNorthernEurope | log(gdp.per.capita):SubRegionPolynesia | log(gdp.per.capita):SubRegionSouth-EasternAsia | log(gdp.per.capita):SubRegionSouthAmerica | log(gdp.per.capita):SubRegionSouthernAfrica | log(gdp.per.capita):SubRegionSouthernAsia | log(gdp.per.capita):SubRegionSouthernEurope | log(gdp.per.capita):SubRegionWesternAfrica | log(gdp.per.capita):SubRegionWesternAsia | log(gdp.per.capita):SubRegionWesternEurope | SubRegionCaribbean | SubRegionCentralAmerica | SubRegionCentralAsia | SubRegionEasternAfrica | SubRegionEasternAsia | SubRegionEasternEurope | SubRegionMelanesia | SubRegionMicronesia | SubRegionMiddleAfrica | SubRegionNorthernAfrica | SubRegionNORTHERNAMERICA | SubRegionNorthernEurope | SubRegionPolynesia | SubRegionSouth-EasternAsia | SubRegionSouthAmerica | SubRegionSouthernAfrica | SubRegionSouthernAsia | SubRegionSouthernEurope | SubRegionWesternAfrica | SubRegionWesternAsia | SubRegionWesternEurope |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 975 | 7945 | 5044 | 3147 | 9101 | 2492 | 5911 | 2988 | 2640 | 5154 | 3892 | 3611 | 5909 | 1944 | 5987 | 7157 | 3933 | 5166 | 7168 | 9144 | 9611 | 5771 | 6747 | 3902 | 2131 | 5646 | 2101 | 4744 | 2281 | 2106 | 3457 | 2936 | 3714 | 5880 | 1584 | 4427 | 5680 | 3037 | 3437 | 6334 | 5707 | 8049 | 5965 |
xkablevif(fit4, title='Fig XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*SubRegion+Year.')| log(gdp.per.capita) | log(gdp.per.capita):SubRegionCaribbean | log(gdp.per.capita):SubRegionCentralAmerica | log(gdp.per.capita):SubRegionCentralAsia | log(gdp.per.capita):SubRegionEasternAfrica | log(gdp.per.capita):SubRegionEasternAsia | log(gdp.per.capita):SubRegionEasternEurope | log(gdp.per.capita):SubRegionMelanesia | log(gdp.per.capita):SubRegionMicronesia | log(gdp.per.capita):SubRegionMiddleAfrica | log(gdp.per.capita):SubRegionNorthernAfrica | log(gdp.per.capita):SubRegionNORTHERNAMERICA | log(gdp.per.capita):SubRegionNorthernEurope | log(gdp.per.capita):SubRegionPolynesia | log(gdp.per.capita):SubRegionSouth-EasternAsia | log(gdp.per.capita):SubRegionSouthAmerica | log(gdp.per.capita):SubRegionSouthernAfrica | log(gdp.per.capita):SubRegionSouthernAsia | log(gdp.per.capita):SubRegionSouthernEurope | log(gdp.per.capita):SubRegionWesternAfrica | log(gdp.per.capita):SubRegionWesternAsia | log(gdp.per.capita):SubRegionWesternEurope | SubRegionCaribbean | SubRegionCentralAmerica | SubRegionCentralAsia | SubRegionEasternAfrica | SubRegionEasternAsia | SubRegionEasternEurope | SubRegionMelanesia | SubRegionMicronesia | SubRegionMiddleAfrica | SubRegionNorthernAfrica | SubRegionNORTHERNAMERICA | SubRegionNorthernEurope | SubRegionPolynesia | SubRegionSouth-EasternAsia | SubRegionSouthAmerica | SubRegionSouthernAfrica | SubRegionSouthernAsia | SubRegionSouthernEurope | SubRegionWesternAfrica | SubRegionWesternAsia | SubRegionWesternEurope | Year1991 | Year1992 | Year1993 | Year1994 | Year1995 | Year1996 | Year1997 | Year1998 | Year1999 | Year2000 | Year2001 | Year2002 | Year2003 | Year2004 | Year2005 | Year2006 | Year2007 | Year2008 | Year2009 | Year2010 | Year2011 | Year2012 | Year2013 | Year2014 | Year2015 | Year2016 | Year2017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 987 | 8010 | 5074 | 3170 | 9188 | 2516 | 5949 | 2997 | 2662 | 5201 | 3917 | 3613 | 5952 | 1951 | 6051 | 7192 | 3955 | 5204 | 7232 | 9195 | 9704 | 5774 | 1.93 | 1.94 | 1.95 | 1.96 | 1.99 | 1.99 | 1.99 | 1.99 | 1.99 | 2.02 | 2.02 | 2.05 | 2.05 | 2.06 | 2.07 | 2.08 | 2.09 | 2.11 | 2.1 | 2.11 | 2.12 | 2.12 | 2.13 | 2.13 | 2.11 | 2.1 | 2.11 | 6800 | 3922 | 2144 | 5695 | 2121 | 4771 | 2285 | 2122 | 3486 | 2952 | 3717 | 5922 | 1587 | 4473 | 5704 | 3051 | 3458 | 6390 | 5730 | 8125 | 5967 |
xkablevif(fit5, title='Fig XX: VIF of lm(log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita)*Year.')| log(gdp.per.capita) | log(gdp.per.capita):Year1991 | log(gdp.per.capita):Year1992 | log(gdp.per.capita):Year1993 | log(gdp.per.capita):Year1994 | log(gdp.per.capita):Year1995 | log(gdp.per.capita):Year1996 | log(gdp.per.capita):Year1997 | log(gdp.per.capita):Year1998 | log(gdp.per.capita):Year1999 | log(gdp.per.capita):Year2000 | log(gdp.per.capita):Year2001 | log(gdp.per.capita):Year2002 | log(gdp.per.capita):Year2003 | log(gdp.per.capita):Year2004 | log(gdp.per.capita):Year2005 | log(gdp.per.capita):Year2006 | log(gdp.per.capita):Year2007 | log(gdp.per.capita):Year2008 | log(gdp.per.capita):Year2009 | log(gdp.per.capita):Year2010 | log(gdp.per.capita):Year2011 | log(gdp.per.capita):Year2012 | log(gdp.per.capita):Year2013 | log(gdp.per.capita):Year2014 | log(gdp.per.capita):Year2015 | log(gdp.per.capita):Year2016 | log(gdp.per.capita):Year2017 | Year1991 | Year1992 | Year1993 | Year1994 | Year1995 | Year1996 | Year1997 | Year1998 | Year1999 | Year2000 | Year2001 | Year2002 | Year2003 | Year2004 | Year2005 | Year2006 | Year2007 | Year2008 | Year2009 | Year2010 | Year2011 | Year2012 | Year2013 | Year2014 | Year2015 | Year2016 | Year2017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34.7 | 187 | 179 | 181 | 177 | 183 | 185 | 186 | 185 | 183 | 185 | 186 | 187 | 187 | 189 | 191 | 194 | 197 | 202 | 209 | 213 | 216 | 219 | 221 | 223 | 224 | 222 | 222 | 187 | 180 | 181 | 177 | 184 | 187 | 189 | 187 | 185 | 187 | 188 | 190 | 192 | 196 | 200 | 205 | 210 | 217 | 223 | 228 | 232 | 236 | 239 | 240 | 240 | 238 | 238 |
Although adding more features into our regression model results in higher R2 values, the Variance Inflation Factor (VIF) for each are extremely high so we will reject those models as those added features are highly correlated with each other. Therefore, we will stick with our second model fit2.
We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:
\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 10.07849 - 0.38952 * log(GDP_{per capita}) \] : Equation 1
or
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 \] : Equation 2
Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?
Let’s test if means of deaths caused by air pollution per 100k across different GDP per capita levels are not equal. * H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp * H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal
Use \(\alpha\) of 0.05.
# divide df$am.spent
final_df_sf$qnt <- cut(final_df_sf$gdp.per.capita , breaks=quantile(final_df_sf$gdp.per.capita),
labels=1:4, include.lowest=TRUE)
# check ranges
tapply(final_df_sf$gdp.per.capita , final_df_sf$qnt , range)
deaths_aov = aov(Deaths.Air.Pollution.per.100k ~ qnt, data=final_df_sf)
deaths_aovsummary = summary(deaths_aov)
deaths_aovsummary
p = deaths_aovsummary[[1]][["Pr(>F)"]][[1]]The p-valuetest1 is 0, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.
I will conduct 4 2-sample t-tests:
- Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
I will use a two sample t-test for each and use \(\alpha\) of 0.05.
Test 1:
ttest1 = t.test(Deaths.Air.Pollution.per.100k ~ qnt==c(1,2), data=final_df_sf, conf=0.95)
ttest1p-valuetest1: 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000299
p-valuetest1 < \(\alpha\)0.05 = TRUE
Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.
Test 2:
ttest2 = t.test(Deaths.Air.Pollution.per.100k ~ qnt==c(2,3), data=final_df_sf, conf=0.95)
ttest2p-valuetest2: 0.000000000000147
p-valuetest2 < \(\alpha\)0.05 = TRUE
Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.
Test 3:
ttest3 = t.test(Deaths.Air.Pollution.per.100k ~ qnt==c(3,4), data=final_df_sf, conf=0.95)
ttest3p-valuetest3: 0
p-valuetest3 < \(\alpha\)0.05 = TRUE
Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Test 3:
ttest4 = t.test(Deaths.Air.Pollution.per.100k ~ qnt==c(1,4), data=final_df_sf, conf=0.95)
ttest4p-valuetest4: 0.00000291
p-valuetest4 < \(\alpha\)0.05 = TRUE
Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
6. Conclusion
From all of our tests, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the strength and model can be quantified by Equation 2:
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 \] : Equation 2